Skip to content

Fix cluster restart failure after bootstrap with built-in log service#266

Draft
Copilot wants to merge 6 commits intomainfrom
copilot/fix-cluster-start-failure
Draft

Fix cluster restart failure after bootstrap with built-in log service#266
Copilot wants to merge 6 commits intomainfrom
copilot/fix-cluster-start-failure

Conversation

Copy link

Copilot AI commented Dec 5, 2025

Here are some reminders before you submit the pull request

  • Add tests for the change
  • Document changes
  • Reference the link of RFC if exists
  • Pass ./mtr --suite=mono_main,mono_multi,mono_basic

Description

During bootstrap, log service raft groups undergo multiple leader elections. On restart, tx_service initializes with stale leader cache, causing connection failures to log groups.

Changes

Added proactive leader discovery after log_agent initialization in Sharder::Init():

if (log_agent_ != nullptr)
{
    log_agent_->Init(txlog_ips_, txlog_ports_, 0);
    
    // Refresh leader info for all log groups
    uint32_t log_group_count = log_agent_->LogGroupCount();
    if (log_group_count == 0)
    {
        LOG(WARNING) << "No log groups found after log agent initialization. "
                        "Log service may not be properly configured.";
    }
    else
    {
        LOG(INFO) << "Refreshing leader info for " << log_group_count
                  << " log group(s) after initialization";
        for (uint32_t lg_id = 0; lg_id < log_group_count; ++lg_id)
        {
            log_agent_->RefreshLeader(lg_id);
            DLOG(INFO) << "Refreshed leader for log group " << lg_id;
        }
    }
}

Key points:

  • RefreshLeader() queries current leader from each log group immediately after initialization
  • Prevents connection failures from stale leader cache on restart
  • Added validation for zero log groups to detect configuration issues
  • Operation is lightweight (cache update only); error handling delegated to log service layer

Impact

Resolves connection failures during cluster restart after bootstrap. Part of coordinated fix with log_service #27, eloq_log_service #33, eloqdoc #345, eloqkv #329, eloqsql #177.

Original prompt

This section details on the original issue you should resolve

<issue_title>[Bug]: Fix cluster start failure after bootstrap with built-in log service</issue_title>
<issue_description>- Software version (the version of the software you're running):(required)
e.g. 0.0.1

  • Reproduce steps (describe steps of reproducing the bug):(required)

    • Bootstrap with eloqdoc cluster of 3 nodes config and built-in log service.
    • Restart eloqdoc cluster failed with sync raft log error.
    • The log service raft group changes caused the issue.
  • Expected result (describe the right result as expected):(required)

  • Actual result (describe the incorrect result actually happened):(required)

  • Related issues (references to the related issues, e.g. eloqdb/repository#number):(optional)
    </issue_description>

Comments on the Issue (you are @copilot in this section)

@githubzilla # Log Service Raft Group Leader Change Analysis

Cluster Configuration

  • Log Group (lg0): Single log group with 3 nodes
  • Node 0: 127.0.0.1:16381
  • Node 1: 127.0.0.1:17381
  • Node 2: 127.0.0.1:18381
  • Group Configuration: 127.0.0.1:16381:0:0,127.0.0.1:17381:0:0,127.0.0.1:18381:0:0

Timeline of Leader Changes

Phase 1: Initial Startup (Single Node)

Time: 15:11:18.727 - Node 0 becomes leader (Term 3)

  • Node 0 starts as a single-node group
  • Becomes leader with term 3
  • Configuration: Only 127.0.0.1:16381:0:0
  • Log Entry (tx_log_0.log:39):
    node lg0:127.0.0.1:16381:0:0 term 3 become leader of group 127.0.0.1:16381:0:0
    
  • Log Entry (tx_log_0.log:53):
    Log node 127.0.0.1:16381 becomes the leader of the log group eloqdata/tx_service#0, term: 3, leader lease valid? yes
    

Phase 2: Cluster Expansion - Node 1 Joins

Time: 15:11:30.818 - Node 1 becomes leader (Term 4)

  • Node 0 was leader but receives PreVote/RequestVote from Node 1 with term 4
  • Node 0 steps down: "stops being the leader" - reason: "Raft node receives higher term request_vote_request"
  • Node 1 wins election and becomes leader
  • Log Entry (tx_log_0.log:247):
    Log node 127.0.0.1:16381 becomes the follower of the group eloqdata/tx_service#0, term: 4
    
  • Log Entry (tx_log_1.log:161):
    node lg0:127.0.0.1:17381:0:0 term 4 become leader of group 127.0.0.1:16381:0:0,127.0.0.1:17381:0:0,127.0.0.1:18381:0:0
    
  • Log Entry (tx_log_1.log:166):
    Log node 127.0.0.1:17381 becomes the leader of the log group eloqdata/tx_service#0, term: 4, leader lease valid? yes
    
  • Leader Refresh (tx_log_0.log:263):
    Refresh log group:0 leader from node_id: 0 to node_id: 1
    
  • Leader Refresh (tx_log_1.log:182):
    Refresh log group:0 leader from node_id: 0 to node_id: 1
    

Observation: Node 0 had log entries up to index 11 from term 2, but Node 1's term 4 entries start from index 1, causing log mismatch issues when Node 1 tries to replicate.

Phase 3: Leadership Transfer Request

Time: 15:11:35.837 - Leadership Transfer Initiated

  • Node 1 (current leader) starts transferring leadership to Node 0
  • Log Entry (tx_log_1.log:183):
    node lg0:127.0.0.1:17381:0:0 starts to transfer leadership to 127.0.0.1:16381:0:0
    
  • Log Entry (tx_log_1.log:184):
    Log node 127.0.0.1:17381 stops being the leader of the log group eloqdata/tx_service#0, status: Raft leader is transferring leadership to 127.0.0.1:16381:0:0
    
  • Node 0 attempts to become leader through election

Phase 4: Node 0 Wins Election

Time: 15:11:37.827 - Node 0 becomes leader (Term 5)

  • After leadership transfer, Node 0 wins election with term 5
  • Log Entry (tx_log_0.log:274):
    node lg0:127.0.0.1:16381:0:0 term 5 become leader of group 127.0.0.1:16381:0:0
    
  • Log Entry (tx_log_0.log:276):
    Log node 127.0.0.1:16381 becomes the leader of the log group eloqdata/tx_service#0, term: 5, leader lease valid? yes
    
  • Leader Refresh (tx_log_0.log:285):
    Update log group:0 leader to node_id:0
    

Issue: Node 0 still has log entries from term 2 (index 13), but Node 2 rejects AppendEntries because of term mismatch (Node 2 expects term 6 entries but Node 0 has term 2 entries).

Phase 5: Node 2 Wins Election

Time: 15:11:39.785 - Node 2 becomes leader (Term 6)

  • Node 2 initiates election with term 6
  • Node 0 receives PreVote/RequestVote from Node 2, steps down
  • Log Entry (tx_log_0.log:303):
    Log node 127.0.0.1:16381 stops being the leader of the log group eloqdata/tx_service#0, status: Raft node receives higher term request_vote_request.
    
  • Log Entry (tx_log_0.log:308):
    Log node 127.0.0.1:16381 becomes the follower of the group eloqdata/tx_service#0, term: 6
    
  • Log Entry (tx_log_1.log:198):
    Log node 127.0.0.1:17381 becomes the follower of the group eloqdata/tx_service#0, term: 6
    
  • **Log Entry (tx_log_2.log:17...

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

@coderabbitai
Copy link

coderabbitai bot commented Dec 5, 2025

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Comment @coderabbitai help to get the list of available commands and usage tips.

This fix ensures that after initializing the log agent, we immediately
refresh the leader information for all log groups. This is critical after
cluster restart scenarios where log group leaders may have changed during
the previous session (e.g., during bootstrap).

The RefreshLeader call will query the current leader from each log group,
updating the leader cache with accurate information before any log
operations begin. This prevents connection failures that occur when
trying to communicate with stale/incorrect leader nodes.

Co-authored-by: githubzilla <10080336+githubzilla@users.noreply.github.com>
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

1 similar comment
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copilot AI and others added 4 commits December 5, 2025 06:23
Added INFO and DEBUG level logging to track leader refresh operations
for better observability and debugging. This helps verify that leader
discovery is working correctly during cluster initialization and
restart scenarios.

Co-authored-by: githubzilla <10080336+githubzilla@users.noreply.github.com>
- Add warning when no log groups are found to detect configuration issues
- Add comment explaining RefreshLeader is lightweight and handles errors
- Improve code robustness and observability

Co-authored-by: githubzilla <10080336+githubzilla@users.noreply.github.com>
- Fix multi-line string formatting for better readability
- log_group_count is already stored in a variable (no redundant calls)
- Improved code consistency

Co-authored-by: githubzilla <10080336+githubzilla@users.noreply.github.com>
The _codeql_detected_source_root file is a temporary artifact from
security scanning and should not be committed to the repository.

Co-authored-by: githubzilla <10080336+githubzilla@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix cluster start failure after bootstrap with log service Fix cluster restart failure after bootstrap with built-in log service Dec 5, 2025
Copilot AI requested a review from githubzilla December 5, 2025 06:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Fix cluster start failure after bootstrap with built-in log service

3 participants